Neural Networks and Deep Learning¶

History¶

NNs started with the motivation to build computers that could mimic how the brain works --- to make computers that could think like humans.

This work actually started in the 1950s, then fell out of favor for a while (first AI winter, 1974-1980).

  • A period of greatly reduced interest and funding in AI research, mostly because of the perception that AI (in general, not just NNs) had failed to live up to its grandiose promises and objectives it had laid out.

NN's gained popularity again in the 1980s and early 1990s before falling out of favor again in late 1990s.

  • Successes: handwriting digit recognition used in check-writing and recognizing zip codes on envelopes.

  • Not necessarily any "failures," but was overtaken in popularity by probabilistic approaches.

Resurgence from around 2005, also "rebranded" with deep learning: nowadays these terms (NNs and deep learning) mean almost the same thing.

Since 2005, NNs/deep learning have revolutionized area after area of comp sci research.

  • Speech recognition came first (2005-2009). This field was initially dominated by hidden Markov models (HMMs), but are now dominated by LSTMs (long shot-term memory), a deep learning technique.

    • An interesting fact is that prior to this era, speech recognition programs often had to be trained on individual speaker's voices.
  • Then came image recognition. In 2012, a CNN (convolutional neural net) called AlexNet, achieved a top-5 error of 13.5% in the ImageNet 2012 Challenge, more than 10.8 percentage points lower than that of the runner up. Made feasibly because of GPUs (graphical processing units).

  • Then text/NLP came next. Earlier than about 2014/2015, most machine translation was using statistical models, but now almost all is done with neural (network) models.

  • So many other areas have been revolutionized as well (medical imaging, advertising, climate change forecasting, musical playlist generation, chatbots, etc)

Brain neurons¶

  • Even though modern neural networks have almost nothing to do with how the brain actually works, there was the early motivation to try to build software to replicate what was happening in the human brain. So it's still worthwhile to understand (a little bit) about how that works.

image.png (from Wikipedia's neuron entry)

  • Neurons can send information to other neurons through electrical impulses. Each neuron in the brain is connected to other neurons and they sometimes modify what set of neurons each one is connected to.

  • Neurons receive electrical signals from their dendrites, and send signals out down their axon. The signals from the dendrites function as "inputs" and the neuron then can determine whether or not to send an "output" signal down the axon based on the inputs. The axon then connects to the dendrites of another neuron. This is greatly simplified, but this is the basis for human thought.

Artificial neural networks¶

  • In software, we will build a very simplified model of real-world neurons. Our neurons will receive inputs (as numbers), do some computations based on those numbers, and produce another number as output.

  • This output will serve as the input to one (or usually more) other neurons. We will often arrange a collection of neurons in layers, where each neuron in the layer performs the same computations.

image.png

Caveat¶

We really don't know how the brain works, and this model of a neuron is vastly simplified from what we know (or think) is happening in the brain.

Every few years neuroscientists learn more and more about what's actually happening, so even though neural networks can do really powerful things, it's not 100% clear that they are truly mimicking what is happening in our brains.

And most computer scientists are ok with that. People who do neural net research have moved away from trying to replicate exactly what is going on in the brain, especially with regards to modern NN engineering techniques, which are all based on what works best, not necessarily how our brain does it.

Why now?¶

What enabled the modern deep learning revolution? It was a combination of things:

  • We have tons of (digitized) data around that we never used to have. So many things are now recorded electronically that never used to be, and so machine learning algorithms can now harness that data to train models.

  • Similarly, we now have incredibly fast computer processors (CPUs and GPUs), which can train models incredibly quickly.

    • GPUs: Graphical Processing Unit. This is a specialized circuit that was originally designed for displaying graphics on your computer (and still is). It contains specialized circuits for doing 2d and 3d computations in parallel, which are the underlying computations needed for computer graphics (everything is based on matrix calculations). Because modern machine learning (and especially NNs/deep learning) is all based around matrix calculations, and can often be paralellized, these chips turned out to be extremely helpful for deep learning as well.

image.png

Example: Demand Prediction¶

  • Let's use an example we haven't seen before: "demand prediction."

  • Suppose we work for a store that sells clothing, and we want to predict whether a new T-shirt design will be a top seller or not. We could set this up as a classification problem, where we're trying to predict "yes" or "no" for whether this shirt will be a top seller. In the real world, we'd probably have lots of features of the shirt, but for the moment, assume we just have one feature, price.

image.png

  • If we were to set up this problem as a logistic regression problem, we would have our single feature (price) be $x_1$ and our model would be
$$f(x) = \frac{1}{1 + e^{-w \cdot x}}$$

where $x$ would be a vector of $x_0$ (the "fake" feature) and $x_1$ (price), and $w$ is a vector of weights.

  • In neural network terms, we're going to add a new term, called the activation, and equate that with $f(x)$ in this case:
$$a = f(x) = \frac{1}{1 + e^{-w \cdot x}}$$

The term comes from when we "activate" a neuron.

  • It turns out that a neural network is built up of many of these tiny logistic regression models, all encapsulated into an entity we'll call a neuron or a "unit."

    • So each neuron takes as input a price $x_1$, computes the formula for $a = f(x)$, and outputs that number (which we interpret as the probability of this shirt being a top seller.)
  • Imagine each neuron as its own little computer doing this calculation.

  • Modern neural networks consist of a wiring these neurons/units together in different ways.

Extending the example to multiple features¶

  • Imagine now we have four features: price, shipping cost, marketing (something indicating how much marketing has been done for this shirt), and material (something indicating the quality of the material used to make the shirt).

  • Additionally, suppose we also think that whether a shirt becomes a top seller or not relates to: (1) affordability (do people think they can afford this shirt), (2) awareness (do people know that the shirt exists), and (3) perceived quality (do people think the shirt is a high-quality product).

  • In some sense, our four features do not map directly onto the three factors that we believe influence whether a shirt will be a top seller. Instead:

    • Price and shipping affect affordability,
    • Marketing affects awareness, and
    • Material and price affect perceived quality.
  • We can connect up the features into a neural network like this:

image.png

  • We group neurons into layers. Each layer contains a set of neurons all performing the same mathematical computations, often (but not always) on the same set of features, and often (but not always) sending their output to the same following layer of neurons.

  • In the picture above, the affordability/awareness/perceived quality collection of neurons form a layer. The probability of being a top seller is also a layer all by itself.

  • The neural network above accepts 4 numbers as inputs (the input layer). Those 4 numbers are sent to the 3 neurons (or units, or nodes) in the middle layer. Each of those neurons does a computation and produces an activation value (which is itself a number). Those 3 numbers are then sent to the next layer, which is called the output layer, which here is only one neuron. The output layer does another computation and produces a final activation value, which is the result of the neural network.

image.png

  • In the picture above, we only connected some nodes of the input layer to some neurons of the middle layer. In practice, we often connect every input in a layer to all the neurons of the following layer.

  • The middle layer(s) of a neural net are sometimes called hidden layers. The reason for this is while we can observe the input layer values and correct output layer values from our training data, we cannot in general know ahead of time the "correct" values for what the numbers at the hidden layer should be.

Key idea so far¶

  • Think of a neural network as a collection of individual logistic regression units. We know that logistic regression can only learn a linear combination of its input features. However, because neural networks are arranged in layers, each layer can learn a linear combination of its input features. This layering idea, combined with the non-linear activation function (the sigmoid function) results in neural networks being able to learn more sophisticated functions than logistic regression can learn.

  • Furthermore, the multiple layers will allow automatic construction of more complicated features: some feature engineering is taken care of for us.

  • In fact, though we came up with "interpretable" features in our middle/hidden layer, one of the main ideas of neural networks is that we don't need to figure out the features of any middle/hidden layers ahead of time. The neural network will figure them out for us.


In general, neural networks can have any number of layers, and any number of neurons in any of the layers.

do this example on board¶

Let's say we want to solve the facial recognition problem.

image.png

What is each layer of the network doing?

image.png

First hidden layer finds straight lines.

2nd hidden layer finds eyes/noses/mouths

3rd hidden layer finds larger portions of faces.

It figures out the features all by itself.

One of the cool things about this particular network is that the way the features are set up, is that each layer only looks at certain sections of the iamge, not the whole image. The first layer looks at very small squares, then larger and larger. In this way, the neural net learns to find features that can appear anywhere in the image.

for cars¶

Same thing happens for cars.

image.png

Math of each layer of a neural net¶

Recall what a single neuron is doing, mathematically. It receives a collection of inputs, let's assume they're in a vector $\boldsymbol{x}$. We assume, as before, that $x_0$ is always 1.

Each neuron also has a weight vector $\boldsymbol{w}$. These two vectors are the same length.

Each neuron computes the dot product $z =\boldsymbol{w} \cdot \boldsymbol{x}$. (We used that $z$ notation in logistic regression as well!).

Each neuron takes this dot product and then passes it through the sigmoid function, sometimes called the activation function, $g(z) = \dfrac{1}{1+e^{-z}}$.

So the complete computation is $a = g(z) = g(\boldsymbol{w} \cdot \boldsymbol{x}) = \dfrac{1}{1+e^{-\boldsymbol{w} \cdot \boldsymbol{x}}}$. This is a single number (a scalar).

image.png

image.png


Now imagine we have an entire layer of neurons. We have to expand our notation a bit.

Each neuron in the layer receives the entire input vector $\boldsymbol{x}$. But each neuron has its own set of weights, so now we have a collection of weight vectors, $\boldsymbol{w}_0, \boldsymbol{w}_1$, etc, one per neuron in the layer.

Similarly, each neuron produces its own activation value, and now since there's one per neuron, we will call them $a_1$, $a_2$, etc. But again, we can collect them into a vector $\boldsymbol{a}$.

image.png

By convention, we call the input layer "layer 0" and each subsequent layer gets one higher number (layer 1, layer 2, etc).

We will use a superscript number in square brackets to denote the variables at each layer.

So the first layer that does any computation is layer 1 (layer 0 is just the input features), so the weight vectors at this layer are now $\boldsymbol{w}_0^{[1]}, \boldsymbol{w}_1^{[1]}$, etc. And the activation values are combined into vector $\boldsymbol{a^{[1]}}.$

image.png

...or, equivalently:

image.png

To make this even more complicated, now the output of layer one becomes the input to layer 2.

image.png

The final step of a neural network is optional, and depends on if we want a binary prediction, or a probability output.

If we want, we can put a threshold onto the final output unit of the neural network. This optional computation will change the output of a neuron from a value between 0-1 to exactly 0 or 1, depending on if the value is greater than or less than 0.5

image.png

A more complex network¶

do on board, showing activation from layer 3 to layer 4.¶

image.png

image.png

General formula¶

$$\boldsymbol{a}_j^{[\ell]} = g \left( \boldsymbol{w}_j^{[\ell]} \cdot \boldsymbol{a}^{[\ell-1]} \right)$$

In the context of neural networks, we will often call $g$ the activation function. $g$ doesn't actually have to be the sigmoid function; it can technically be any function, but there are a few common activation functions that we use, the sigmoid function being one of them.

To make our notation consistent, we will also sometimes use $\boldsymbol{a}_0$ as another name for our input vector $\boldsymbol{x}$.

Making predictions (inference)¶

Remember, of course, that we want to use neural networks, like any machine learning model, to make predictions about data. In linear and logistic regression, we defined our models as functions $f$. In neural networks, it's a little bit more complicated to do this with a single function since we have multiple layers of neurons, each computing its own values, but we can certainly do it!

For a neural network with its output layer being called layer $L$, we define $f(x) = \boldsymbol{a}^{[L]}$, in other words, the output of the model is just the output of the output layer.

Of course, we compute that vector $\boldsymbol{a}^{[L]}$ iteratively, by starting with the input layer $\boldsymbol{a}^{[0]}=\boldsymbol{x}$, and moving forward through the layers until we reach the output layer, computing each layer of numbers along the way. For this reason, this is called the forward propogation algorithm.

example with handwritten digit recognition¶

image.png

image.png

image.png

Vectorized version of forward propogation¶

Suppose we have a vector $\boldsymbol{a}$ representing the output from a layer of the neural network (or the input features $\boldsymbol{x}$).

If we want to compute what happens at the next layer, we know we can use the general formula

$$\boldsymbol{a}_j^{[\ell]} = g \left( \boldsymbol{w}_j^{[\ell]} \cdot \boldsymbol{a}^{[\ell-1]} \right)$$

However, the computation above defines each term of the $\boldsymbol{a}$ vector separately. In other words, this equation above is actually a bunch of equations, one per neuron $j$ in the layer.

To do all the computations for the equations at once, what we can do is define a weight matrix $\boldsymbol{W}$ like we did for logistic regression.

$$\begin{bmatrix} \leftarrow & \boldsymbol{w_1} & \rightarrow\\ \leftarrow & \boldsymbol{w_2} & \rightarrow\\ & \vdots & \\ \end{bmatrix}$$

Note that this matrix has the same number of rows as the number of neurons in the layer being computed, and the same number of columns as the number of neurons in the previous layer.

We can then calculate $\boldsymbol{a}$ all at once by:

$$\boldsymbol{z}^{[\ell]} = \boldsymbol{W}^{[\ell]}\boldsymbol{a}^{[\ell-1]}$$$$\boldsymbol{a}^{[\ell]} = g(z^{[\ell]})$$

where the meaning of the $g$ above, but applied to a vector, is we evaluate $g$ on each item in the vector, separately.

Remember, to multiply $\boldsymbol{W}^{[\ell]}$ by $\boldsymbol{a}^{[\ell-1]}$, the number of columns of $\boldsymbol{W}^{[\ell]}$ must match the number of rows of $\boldsymbol{a}^{[\ell-1]}$. This should make sense because both of those quantities are the number of neurons in the previous layer of the network.

Furthermore, the resulting vector from that multiplication will have the number of rows of $\boldsymbol{W}^{[\ell]}$ and the number of columns of $\boldsymbol{a}^{[\ell-1]}$. These two quantities are, respectively, the number of neurons in the current layer, and simply 1, because each vector $\boldsymbol{a}$ is a column vector. So therefore $\boldsymbol{a}^{[\ell]}$ is a column vector with the number of entries matching the number of neurons.

In [ ]:
 
In [ ]: